An Algorithm for Estimating the Parameters of Unrestricted Hidden Stochastic Context-Free Grammars
نویسنده
چکیده
I N T R O D U C T I O N This paper describes an iterative method for estimat ing the parameters of a hidden stochastic contextfree grammar (SCFG). The "hidden" aspect arises from the fact that ~ome information is not available when the grammar is trained. When a parsed corpus is used for training, production probabilities can be estimated by counting the number of times each production is used in the parsed corpus. In the case of a hidden SCFG, the characteristic grammar is defined but the parse trees associated with the training corpus are not available. To proceed in this circumstance, some initim prohabilitie~ are assigned which are iteratively reestimated from their current values, and the training corpus. They are adjusted to (locally) maximize the likelihood of generating the training corpus. The EM algorithm (Dempster, 1977) embodies the approach just mentioned; the new algorithm can be viewed as its application to arbitrary SCFG's. The use of unparsed training corpora is desirable because changes in the grammar rules could conceivably require manually reparsing the training corpus several times during grammar development. Stochastic grammarsenable ambiguity resolution to performed on the rational basis of niost likely interpretation. They also acconnnodate the development of more robust grammars having high coverage where the attendant ambiguity is generally higher. Previous approaches to the problem of estimating hidden SCFG's include parsing schemes ill which MI derivations of all sentences in the t ra ining corpus are enumerated (Fujisaki et al., 1989; Chitrao & Grishman, 1990)). An efficient alternative is the Inside/Outside (I/O) algorithm (Baker, 1979) which like the new algorithm, is limited to cubic complexity in both the number of nonterminais and length of a ~entence. The I /O algorithm requires that tile grammar be in Chonmky normal form (CNF). Tile new algorithm hal the same complexity, but does not have this restriction, dispensing with the need to transform to and from GNF. T E R M I N O L O G Y The training corpus can be conwmiently segmented into sentences for puposes of training; each sentence compris inga sequence of words. A typical one may consist o fY + 1 words, indexed from O to Y: The lookup function W(y) returns the index k of the vocabulary entry vk matching tile word w~ at position y ill tile sentence. The algorithm uses a extension of the representation and terminology used for "hidden Markov modeis '(hidden stochastic regular grammars) for which the Baum-Welch algorithm (Baum, 1972) is applicable (and which is also called the Forward/Backward (F/B)algo~ rithm). Grammar rules are represented as networks and illustrated graphically, maintaining a correspondence ACRES DE COLING-92, NANIES, 23-28 AO~r 1992 3 8 7 I'ROC. OF COLING-92, NANTES, AUG. 23-28. 1992
منابع مشابه
Stochastic Tree-Adjoining Grammars
A B S T R A C T The notion of stochastic lexicalized tree-adjoining grammar (SLTAG) is defined and basic algorithms for SLTAG are designed. The parameters of a SLTAG correspond to the probability of combining two structures each one associated with a word. The characteristics of SLTAG are unique and novel since it is lexically sensitive (as N-gram models or Hidden Markov Models) and yet hierarc...
متن کاملA Trellis-Based Algorithm For Estimating The Parameters Of Hidden Stochastic Context-Free Grammar
I N T R O D U C T I O N The algorithm described in this paper is concerned with using hidden Markov methods for estimation of the parameters of a stochastic context-free grammar from free text. The Forward/Backward (F/B) algorithm (Baum, 1972) is capable of estimating the parameters of a hidden Markov model (i.e. a hidden stochastic regular grammar) and has been used with success to train text ...
متن کاملWeakly Restricted Stochastic Grammars
A new type of stochastic grammars is introduced for investigation: weakly restricted stochastic grammars. In this paper we will concentrate on the consistency problem. To nd conditions for stochastic grammars to be consistent, the theory of multitype Galton-Watson branching processes and generating functions is of central importance. The unrestricted stochastic grammar formalism generates the s...
متن کاملUsing evolutionary Expectation Maximization to estimate indel rates
MOTIVATION The Expectation Maximization (EM) algorithm, in the form of the Baum-Welch algorithm (for hidden Markov models) or the Inside-Outside algorithm (for stochastic context-free grammars), is a powerful way to estimate the parameters of stochastic grammars for biological sequence analysis. To use this algorithm for multiple-sequence evolutionary modelling, it would be useful to apply the ...
متن کاملLearning Stochastic Context-Free Grammars from Corpora Using a Genetic Algorithm
A genetic algorithm for inferring stochastic context-free grammars from nite language samples is described. Solutions to the inference problem are evolved by optimizing the parameters of a covering grammar for a given language sample. We describe a number of experiments in learning grammars for a range of formal languages. The results of these experiments are encouraging and compare very favour...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1992